# ------------------------------------------------------------------------------------------
# # Introduction to Queryverse.jl
# ------------------------------------------------------------------------------------------

# ------------------------------------------------------------------------------------------
# This notebook has all the code required to follow the tutorial by [David
# Anthoff](https://www.david-anthoff.com/) on 6/14/2018. This notebook _is not_ a self
# contained introduction to the
# [Queryverse.jl](https://github.com/davidanthoff/Queryverse.jl), instead it is meant to
# complement the video tutorial.
# ------------------------------------------------------------------------------------------

# ------------------------------------------------------------------------------------------
# ## Logistics
# ------------------------------------------------------------------------------------------

# ------------------------------------------------------------------------------------------
# ![alt](slides/Slide3.PNG)
# ------------------------------------------------------------------------------------------

# ------------------------------------------------------------------------------------------
# ![alt](slides/Slide4.PNG)
# ------------------------------------------------------------------------------------------

# ------------------------------------------------------------------------------------------
# ## End-to-end example
# ------------------------------------------------------------------------------------------

using Queryverse, VegaDatasets, IndexedTables

# ------------------------------------------------------------------------------------------
# The `dataset` function from the
# [VegaDatasets.jl](https://github.com/davidanthoff/VegaDatasets.jl) package returns a
# number of useful example datasets.
# ------------------------------------------------------------------------------------------

cars = dataset("cars")

# ------------------------------------------------------------------------------------------
# __WARNING__: The following cell will not run on juliabox, you need to run it on a local
# Jupyter instance.
# ------------------------------------------------------------------------------------------

# ------------------------------------------------------------------------------------------
# Piping the `cars` dataset into
# [DataVoyager.jl](https://github.com/davidanthoff/DataVoyager.jl) opens a UI for data
# exploration and plotting.
# ------------------------------------------------------------------------------------------

cars |> Voyager()

# ------------------------------------------------------------------------------------------
# The default view shows a plot per column, picking an appropriate style based on the data
# type of the column.
# ![alt](screenshots/voyager1.png)
# ------------------------------------------------------------------------------------------

# ------------------------------------------------------------------------------------------
# Dragging the quantitative wildcard fields on both the x and y channel shows as correlation
# plots.
# ![alt](screenshots/voyager2.png)
# ------------------------------------------------------------------------------------------

# ------------------------------------------------------------------------------------------
# We can use [Query.jl](https://github.com/davidanthoff/Query.jl) operators to filter the
# `cars` dataset.
# ------------------------------------------------------------------------------------------

cars |>
@filter(_.Origin=="USA")

# ------------------------------------------------------------------------------------------
# The file IO features of the Queryverse allow us to save the filtered dataset as a CSV
# file.
# ------------------------------------------------------------------------------------------

cars |>
@filter(_.Origin=="USA") |>
save("us_cars.csv")

# ------------------------------------------------------------------------------------------
# We can also load CSV files from disc, and directly create a plot for that data with
# [VegaLite.jl](https://github.com/fredo-dedup/VegaLite.jl).
# ------------------------------------------------------------------------------------------

load("us_cars.csv") |>
@vlplot(:point, x=:Miles_per_Gallon, y=:Weight_in_lbs, color="Cylinders:n")

# ------------------------------------------------------------------------------------------
# ## File IO
# ------------------------------------------------------------------------------------------

# ------------------------------------------------------------------------------------------
# Queryverse provides a unified file IO API for a large number of tabular file formats.
# ------------------------------------------------------------------------------------------

# ------------------------------------------------------------------------------------------
# ![alt](slides/Slide7.PNG)
# ------------------------------------------------------------------------------------------

# ------------------------------------------------------------------------------------------
# ### Load
# ------------------------------------------------------------------------------------------

# ------------------------------------------------------------------------------------------
# ![alt](slides/Slide8.PNG)
# ------------------------------------------------------------------------------------------

# ------------------------------------------------------------------------------------------
# Load a CSV file:
# ------------------------------------------------------------------------------------------

load("data/cars.csv")

# ------------------------------------------------------------------------------------------
# Load a feather file:
# ------------------------------------------------------------------------------------------

load("data/cars.feather")

# ------------------------------------------------------------------------------------------
# Load an Excel file (note how we have to specify the sheet we want to load):
# ------------------------------------------------------------------------------------------

load("data/cars.xlsx", "cars")

# ------------------------------------------------------------------------------------------
# We can also specify the exact range in the Excel file we want to load:
# ------------------------------------------------------------------------------------------

load("data/cars.xlsx", "cars!C1:G41")

# ------------------------------------------------------------------------------------------
# We can also load various stats files:
# ------------------------------------------------------------------------------------------

load("data/cars.dta")

load("data/cars.sas7bdat")

load("data/cars.sav")

# ------------------------------------------------------------------------------------------
# ### Save
# ------------------------------------------------------------------------------------------

# ------------------------------------------------------------------------------------------
# ![alt](slides/Slide9.PNG)
# ------------------------------------------------------------------------------------------

# ------------------------------------------------------------------------------------------
# Remember, the `cars` variable is a table we previously loaded:
# ------------------------------------------------------------------------------------------

cars

# ------------------------------------------------------------------------------------------
# We can save it as a CSV file with this syntax:
# ------------------------------------------------------------------------------------------

save("cars1.csv", cars)

# ------------------------------------------------------------------------------------------
# We can also pipe the data into the `save` function:
# ------------------------------------------------------------------------------------------

cars |> save("cars2.csv")

# ------------------------------------------------------------------------------------------
# Depending on the file format, we can specify various options, for example the delimiter
# character for CSV files:
# ------------------------------------------------------------------------------------------

cars |> save("cars3.csv", delim=';')

# ------------------------------------------------------------------------------------------
# We can also save as a feather file:
# ------------------------------------------------------------------------------------------

cars |> save("cars4.feather")

# ------------------------------------------------------------------------------------------
# ### Combine load and save
# ------------------------------------------------------------------------------------------

# ------------------------------------------------------------------------------------------
# We can also load a file in one format and directly pipe it into the `save` function and
# save it in another format. Note that the data will never be materialized as say a
# `DataFrame` in that case.
# ------------------------------------------------------------------------------------------

load("data/cars.sas7bdat") |> save("cars5.csv")

# ------------------------------------------------------------------------------------------
# ## Tables
# ------------------------------------------------------------------------------------------

# ------------------------------------------------------------------------------------------
# ![alt](slides/Slide11.PNG)
# ------------------------------------------------------------------------------------------

# ------------------------------------------------------------------------------------------
# ![alt](slides/Slide12.PNG)
# ------------------------------------------------------------------------------------------

# ------------------------------------------------------------------------------------------
# ![alt](slides/Slide13.PNG)
# ------------------------------------------------------------------------------------------

# ------------------------------------------------------------------------------------------
# Note how the load function returns a different type for each kind of file we can load. All
# of them implement the TableTraits.jl interface.
# ------------------------------------------------------------------------------------------

load("data/cars.csv") |> typeof

load("data/cars.feather") |> typeof

dataset("cars") |> typeof

# ------------------------------------------------------------------------------------------
# The constructors of most table types accept any table, and will automatically convert the
# table at construction. For example, the following code loads a feather file. The call to
# `load` returns a `FeatherFile` instance, which implements the TableTraits.jl interface.
# When we pass that to the `DataFrame` constructor, it automatically constructs a
# `DataFrame` from the data in the file.
# ------------------------------------------------------------------------------------------

DataFrame(load("data/cars.feather"))

# ------------------------------------------------------------------------------------------
# We can also pipe things into a `DataFrame`:
# ------------------------------------------------------------------------------------------

load("data/cars.feather") |> DataFrame

# ------------------------------------------------------------------------------------------
# We can also pipe things into an `IndexedTable`:
# ------------------------------------------------------------------------------------------

load("data/cars.feather") |> table

# ------------------------------------------------------------------------------------------
# Because both `DataFrame` and `IndexedTable` implement the TableTraits.jl interface, we can
# also do silly things like 1) load a feather file, 2) convert it into an `IndexedTable`,
# then 3) convert that `IndexedTable` into a `DataFrame`.
# ------------------------------------------------------------------------------------------

load("data/cars.feather") |> table |> DataFrame

# ------------------------------------------------------------------------------------------
# It works the other way around as well:
# ------------------------------------------------------------------------------------------

load("data/cars.feather") |> DataFrame |> table

# ------------------------------------------------------------------------------------------
# Because [VegaLite.jl](https://github.com/fredo-dedup/VegaLite.jl) accepts data as any
# TableTraits.jl table, we can plot `DataFrame`s:
# ------------------------------------------------------------------------------------------

load("data/cars.feather") |> DataFrame |> @vlplot(:point, x=:Horsepower, y=:Acceleration)

# ------------------------------------------------------------------------------------------
# Or we can plot `IndexedTable`s
# ------------------------------------------------------------------------------------------

load("data/cars.feather") |> table |> @vlplot(:point, x=:Horsepower, y=:Acceleration)

# ------------------------------------------------------------------------------------------
# Or we can skip the table types entirely and directly plot the file:
# ------------------------------------------------------------------------------------------

load("data/cars.feather") |> @vlplot(:point, x=:Horsepower, y=:Acceleration)

# ------------------------------------------------------------------------------------------
# We can also do lots of other combinations. Here we load a stats file into an
# `IndexedTable`:
# ------------------------------------------------------------------------------------------

it = load("data/cars.sas7bdat") |> table

# ------------------------------------------------------------------------------------------
# Then we convert it into a `DataFrame`:
# ------------------------------------------------------------------------------------------

df = it |> DataFrame

# ------------------------------------------------------------------------------------------
# We can pipe the `IndexedTable` into Voyager:
# ------------------------------------------------------------------------------------------

it |> Voyager()

# ------------------------------------------------------------------------------------------
# Or we can pipe the `DataFrame` into Voyager:
# ------------------------------------------------------------------------------------------

df |> Voyager()

# ------------------------------------------------------------------------------------------
# All the file IO functions work with anything that implements the TableTraits.jl interface,
# so we can e.g. save both an `IndexedTable` and a `DataFrame` as a feather file:
# ------------------------------------------------------------------------------------------

it |> save("cars8.feather")

df |> save("cars9.feather")

# ------------------------------------------------------------------------------------------
# ![alt](slides/Slide14.PNG)
# ------------------------------------------------------------------------------------------

# ------------------------------------------------------------------------------------------
# ## The pipe
# ------------------------------------------------------------------------------------------

# ------------------------------------------------------------------------------------------
# ![alt](slides/Slide16.PNG)
# ------------------------------------------------------------------------------------------

# ------------------------------------------------------------------------------------------
# We can pipe in one line:
# ------------------------------------------------------------------------------------------

load("data/cars.csv") |> @filter(_.Origin=="Europe") |> DataFrame

# ------------------------------------------------------------------------------------------
# Or we can write each step on its own line:
# ------------------------------------------------------------------------------------------

load("data/cars.csv") |>
    @filter(_.Origin=="Europe") |>
    DataFrame

# ------------------------------------------------------------------------------------------
# The following example doesn't return anything at the end because the `save` function has
# no return value:
# ------------------------------------------------------------------------------------------

load("data/cars.csv") |>
    @filter(_.Origin=="Europe") |>
    save("cars10.csv")

# ------------------------------------------------------------------------------------------
# ![alt](slides/Slide17.PNG)
# ------------------------------------------------------------------------------------------

# ------------------------------------------------------------------------------------------
# But sometimes we want to continue the pipe, for example to display the table at the
# interactive prompt:
# ------------------------------------------------------------------------------------------

load("data/cars.csv") |>
    @filter(_.Origin=="Europe") |>
    @tee(save("cars10.csv"))

# ------------------------------------------------------------------------------------------
# Or to save the data as a CSV File and then create a plot:
# ------------------------------------------------------------------------------------------

load("data/cars.csv") |>
    @filter(_.Origin=="Europe") |>
    @tee(save("cars10.csv")) |>
    @vlplot(:point, x=:Acceleration, y=:Horsepower)

# ------------------------------------------------------------------------------------------
# We can also pipe the plot to the `save` function, i.e. it is not just for tabular data!
# ------------------------------------------------------------------------------------------

load("data/cars.csv") |>
    @filter(_.Origin=="Europe") |>
    @tee(save("cars10.csv")) |>
    @vlplot(:point, x=:Acceleration, y=:Horsepower) |>
    save("fig1.png")

# ------------------------------------------------------------------------------------------
# If we want to save the plot in multiple formats and see it at the end, we can use repeated
# `@tee` calls:
# ------------------------------------------------------------------------------------------

load("data/cars.csv") |>
    @filter(_.Origin=="Europe") |>
    @tee(save("cars10.csv")) |>
    @vlplot(:point, x=:Acceleration, y=:Horsepower) |>
    @tee(save("fig1.png")) |>
    @tee(save("fig1.pdf"))

# ------------------------------------------------------------------------------------------
# ## Queries
# ------------------------------------------------------------------------------------------

# ------------------------------------------------------------------------------------------
# ### Basics
# ------------------------------------------------------------------------------------------

# ------------------------------------------------------------------------------------------
# ![alt](slides/Slide19.PNG)
# ------------------------------------------------------------------------------------------

load("data/cars.feather") |> @filter(_.Origin=="Europe") |> @orderby(_.Horsepower)

# ------------------------------------------------------------------------------------------
# Queries are not restricted to tabular data.
# ------------------------------------------------------------------------------------------

# ------------------------------------------------------------------------------------------
# ![alt](slides/Slide20.PNG)
# ------------------------------------------------------------------------------------------

1:8 |> @filter(_%2==0) |> @orderby_descending(_)

# ------------------------------------------------------------------------------------------
# If we pipe things into `collect` at the end, we generate an array:
# ------------------------------------------------------------------------------------------

1:8 |> @filter(_%2==0) |> @orderby_descending(_) |> collect

# ------------------------------------------------------------------------------------------
# We can also start from a `Dict`:
# ------------------------------------------------------------------------------------------

Dict(:a=>4, :b=>2, :c=>8)

# ------------------------------------------------------------------------------------------
# And filter the `Dict`:
# ------------------------------------------------------------------------------------------

Dict(:a=>4, :b=>2, :c=>8) |> @filter(_[2]>3)

# ------------------------------------------------------------------------------------------
# And collect the pairs into an array:
# ------------------------------------------------------------------------------------------

Dict(:a=>4, :b=>2, :c=>8) |> @filter(_[2]>3) |> collect

# ------------------------------------------------------------------------------------------
# Or we can collec them into another `Dict`. Not that this will only work if the pipe
# generates a stream of `Pair`s:
# ------------------------------------------------------------------------------------------

Dict(:a=>4, :b=>2, :c=>8) |> @filter(_[2]>3) |> Dict

# ------------------------------------------------------------------------------------------
# We can also start from a generator:
# ------------------------------------------------------------------------------------------

(i=>i^2 for i in 1:10) |> @filter(_[2]>3) |> @orderby(_[1]) |> collect

# ------------------------------------------------------------------------------------------
# ### @filter
# ------------------------------------------------------------------------------------------

# ------------------------------------------------------------------------------------------
# `@filter` removes elements from the input stream based on some condition.
# ------------------------------------------------------------------------------------------

1:10 |> @filter(_>5)

# ------------------------------------------------------------------------------------------
# ![alt](slides/Slide21.PNG)
# ------------------------------------------------------------------------------------------

# ------------------------------------------------------------------------------------------
# You don't have to use the anonymous function shorthand, you can also use normal anonymous
# functions (but why would you, more typing!):
# ------------------------------------------------------------------------------------------

1:10 |> @filter(i -> i>5)

# ------------------------------------------------------------------------------------------
# ### @map
# ------------------------------------------------------------------------------------------

# ------------------------------------------------------------------------------------------
# `@map` applies a transformation function to each element from the input stream. Here we
# square every element:
# ------------------------------------------------------------------------------------------

1:10 |> @map(_^2)

# ------------------------------------------------------------------------------------------
# Here we create a `Pair` for each input element, so that we can pipe things into a `Dict`:
# ------------------------------------------------------------------------------------------

1:10 |> @map(_=>_^2) |> Dict

# ------------------------------------------------------------------------------------------
# Here we extract the `Origin` column for the `cars` table:
# ------------------------------------------------------------------------------------------

cars |> @map(_.Origin)

# ------------------------------------------------------------------------------------------
# #### Brief detour about named tuples
# ------------------------------------------------------------------------------------------

# ------------------------------------------------------------------------------------------
# First, lets store the stream of named tuples in an array:
# ------------------------------------------------------------------------------------------

r = cars |> collect

# ------------------------------------------------------------------------------------------
# We then access the first named tuple from that array:
# ------------------------------------------------------------------------------------------

nt = r[1]

# ------------------------------------------------------------------------------------------
# And look at the type:
# ------------------------------------------------------------------------------------------

typeof(nt)

# ------------------------------------------------------------------------------------------
# We can use the `fieldnames` function to inquire about the names of the fields of the named
# tuple:
# ------------------------------------------------------------------------------------------

fieldnames(nt)

# ------------------------------------------------------------------------------------------
# We can access a field with the `.` notation:
# ------------------------------------------------------------------------------------------

nt.Origin

# ------------------------------------------------------------------------------------------
# Or by indexing, the way we would index into a normal tuple:
# ------------------------------------------------------------------------------------------

nt[3]

# ------------------------------------------------------------------------------------------
# So the example from above extracts the `Origin` field for each row it gets, at which point
# we have a stream of `String`s.
# ------------------------------------------------------------------------------------------

cars |> @map(_.Origin)

# ------------------------------------------------------------------------------------------
# ![alt](slides/Slide22.PNG)
# ------------------------------------------------------------------------------------------

# ------------------------------------------------------------------------------------------
# We can construct named tuples with the `{}` syntax. Note that this only works within
# Query.jl macros. Whenever we create named tuples in a `@map`, we are creating a table,
# because a table is a stream of named tuples:
# ------------------------------------------------------------------------------------------

1:10 |> @map({foo=_, bar=_^2})

# ------------------------------------------------------------------------------------------
# Lets remember the columns of the `cars` dataset:
# ------------------------------------------------------------------------------------------

cars

# ------------------------------------------------------------------------------------------
# We can extract only two columns:
# ------------------------------------------------------------------------------------------

cars |> @map({_.Name, _.Year})

# ------------------------------------------------------------------------------------------
# We can extract two columns and rename one:
# ------------------------------------------------------------------------------------------

cars |> @map({_.Name, Foo=_.Year})

# ------------------------------------------------------------------------------------------
# Because we are producing a table with the `@map` command, we can save things as a CSV
# file:
# ------------------------------------------------------------------------------------------

cars |> @map({_.Name, Foo=_.Year}) |> save("cars11.csv")

# ------------------------------------------------------------------------------------------
# We can also construct `Pair`s from a table and then store things as a `Dict`:
# ------------------------------------------------------------------------------------------

cars |> @map(_.Name => Date(_.Year)) |> Dict

# ------------------------------------------------------------------------------------------
# ### @take and @drop
# ------------------------------------------------------------------------------------------

# ------------------------------------------------------------------------------------------
# `@drop(n)` drops the first `n` elements from the input stream:
# ------------------------------------------------------------------------------------------

1:10 |> @drop(3)

# ------------------------------------------------------------------------------------------
# `@take(n)` takes the first `n` elements from the input stream:
# ------------------------------------------------------------------------------------------

1:10 |> @drop(3) |> @take(4)

# ------------------------------------------------------------------------------------------
# That of course also works with tables. To view only row 4-7 we can do:
# ------------------------------------------------------------------------------------------

cars |> @drop(3) |> @take(4)

# ------------------------------------------------------------------------------------------
# ### Sorting
# ------------------------------------------------------------------------------------------

# ------------------------------------------------------------------------------------------
# The `@orderby` operators accepts an anonymous function that extracts a key for each input
# element. The elements are then sorted by that key. For example, to sort the `cars` dataset
# by the `Origin` column, we can write this code:
# ------------------------------------------------------------------------------------------

cars |> @orderby(_.Origin)

# ------------------------------------------------------------------------------------------
# We can also sort in descending order:
# ------------------------------------------------------------------------------------------

cars |> @orderby_descending(_.Origin)

# ------------------------------------------------------------------------------------------
# There are lots of rows that have the same value for `Origin`. We can use the `@thenby`
# command to specify a secondary sort key:
# ------------------------------------------------------------------------------------------

cars |> @orderby(_.Origin) |> @thenby(_.Year)

# ------------------------------------------------------------------------------------------
# You can chain as many `@thenby` operators as you want:
# ------------------------------------------------------------------------------------------

cars |> @orderby(_.Origin) |> @thenby(_.Year) |> @thenby_descending(_.Horsepower)

# ------------------------------------------------------------------------------------------
# The key extraction function is not restricted to just returning a value, it can also call
# arbitrary julia functions. For example, here we sort the `cars` dataset by the character
# length of the `Origin` column:
# ------------------------------------------------------------------------------------------

cars |> @orderby(length(_.Origin))

# ------------------------------------------------------------------------------------------
# ### @groupby
# ------------------------------------------------------------------------------------------

# ------------------------------------------------------------------------------------------
# ![alt](slides/Slide23.PNG)
# ------------------------------------------------------------------------------------------

# ------------------------------------------------------------------------------------------
# Here we group by the `Origin` column:
# ------------------------------------------------------------------------------------------

cars |> @groupby(_.Origin)

# ------------------------------------------------------------------------------------------
# We can look at the keys of each group:
# ------------------------------------------------------------------------------------------

cars |> @groupby(_.Origin) |> @map(_.key)

# ------------------------------------------------------------------------------------------
# We can also look at how many elements there are in each group:
# ------------------------------------------------------------------------------------------

cars |> @groupby(_.Origin) |> @map(length(_))

# ------------------------------------------------------------------------------------------
# We can also extract e.g. the third element for each group:
# ------------------------------------------------------------------------------------------

cars |> @groupby(_.Origin) |> @map(_[3])

# ------------------------------------------------------------------------------------------
# In the next example we are creating a table that shows us how many elements we have in
# each group as a table:
# ------------------------------------------------------------------------------------------

cars |> @groupby(_.Origin) |> @map({Origin=_.key, Count=length(_)})

# ------------------------------------------------------------------------------------------
# ![alt](slides/Slide24.PNG)
# ------------------------------------------------------------------------------------------

# ------------------------------------------------------------------------------------------
# The anonymous function that returns the group key can call any julia function. For
# example, we can group things by the length of the `Origin` column:
# ------------------------------------------------------------------------------------------

cars |> @groupby(length(_.Origin)) |> @map({OriginLength=_.key, Count=length(_)})

# ------------------------------------------------------------------------------------------
# ![alt](slides/Slide25.PNG)
# ------------------------------------------------------------------------------------------

# ------------------------------------------------------------------------------------------
# We can also supply a second anonymous function that gets applied to each element from the
# source before that element gets placed into a group. This can be helpful if we want to
# take e.g. the mean for each group:
# ------------------------------------------------------------------------------------------

cars |> @groupby(_.Origin, _.Acceleration) |> @map({Origin=_.key, MeanAcceleration=mean(_)})

# ------------------------------------------------------------------------------------------
# ![alt](slides/Slide26.PNG)
# ------------------------------------------------------------------------------------------

# ------------------------------------------------------------------------------------------
# ![alt](slides/Slide27.PNG)
# ------------------------------------------------------------------------------------------

# ------------------------------------------------------------------------------------------
# The following example shows an alternative way to computing the mean of a given column for
# each group:
# ------------------------------------------------------------------------------------------

cars |> @groupby(_.Origin) |> @map({Origin=_.key, MeanAcceleration=mean(_..Acceleration)})

# ------------------------------------------------------------------------------------------
# This alternative style is especially useful if we want to compute aggregates for multiple
# columns:
# ------------------------------------------------------------------------------------------

region_stats = cars |>
    @groupby(_.Origin) |>
    @map({
        Region=_.key,
        MeanAcceleration=mean(_..Acceleration),
        MinCylinders=minimum(_..Cylinders)
    })

# ------------------------------------------------------------------------------------------
# ### @join
# ------------------------------------------------------------------------------------------

# ------------------------------------------------------------------------------------------
# Joins combine elements from two sources into one stream of elements.
# ------------------------------------------------------------------------------------------

cars

region_stats

# ------------------------------------------------------------------------------------------
# Here we pick all row combinations where the `Origin` and `Region` column match, and then
# do some computations:
# ------------------------------------------------------------------------------------------

cars |>
    @join(region_stats, _.Origin, _.Region, {_.Name, reg_ex_acc=_.Acceleration - __.MeanAcceleration}) |>
    @orderby_descending(_.reg_ex_acc)

# ------------------------------------------------------------------------------------------
# ### When is processing happening?
# ------------------------------------------------------------------------------------------

# ------------------------------------------------------------------------------------------
# The following line does zero data processing because we don't display the results (note
# the `;` at the end of the line):
# ------------------------------------------------------------------------------------------

q = load("data/cars.feather") |>
    @filter(_.Origin=="USA") |>
    @map({_.Name, _.Cylinders}) |>
    @take(5);

# ------------------------------------------------------------------------------------------
# Only when we start to iterate the results is any data read from disc and processed:
# ------------------------------------------------------------------------------------------

for row in q
    println(row)
end

# ------------------------------------------------------------------------------------------
# ![alt](slides/Slide28.PNG)
# ------------------------------------------------------------------------------------------

# ------------------------------------------------------------------------------------------
# The following example is completely streaming. Each row is read from disc, filtered,
# projected and saved into the CSV file before the next row is read. In some sense this is
# resembling loop fusion.
# ------------------------------------------------------------------------------------------

load("data/cars.feather") |>
    @filter(_.Origin=="USA") |>
    @map({_.Name, _.Cylinders}) |>
    save("cars12.csv")

# ------------------------------------------------------------------------------------------
# ## Plotting & visual exploration
# ------------------------------------------------------------------------------------------

# ------------------------------------------------------------------------------------------
# Lets look at the `cars` dataset again:
# ------------------------------------------------------------------------------------------

cars

# ------------------------------------------------------------------------------------------
# We can pipe it into a plot. We must first specify the kind of `mark` we want to use. In
# this example we are using a point mark. Vega-Lite will draw a point for each row in our
# dataset. Because we haven't specified any encodings, all points are drawn on top of each
# other:
# ------------------------------------------------------------------------------------------

cars |> @vlplot(:point)

# ------------------------------------------------------------------------------------------
# We can start to make the plot more useful by encoding the `x` position of the points to
# the `Miles_per_Gallon` column:
# ------------------------------------------------------------------------------------------

cars |> @vlplot(:point, x=:Miles_per_Gallon)

# ------------------------------------------------------------------------------------------
# If we also encode the `y` channel of the mark, we start to get a useful plot:
# ------------------------------------------------------------------------------------------

cars |> @vlplot(:point, x=:Miles_per_Gallon, y=:Acceleration)

# ------------------------------------------------------------------------------------------
# We can further split things up by coloring each point based on the content of the `Origin`
# column:
# ------------------------------------------------------------------------------------------

cars |> @vlplot(:point, x=:Miles_per_Gallon, y=:Acceleration, color=:Origin)

# ------------------------------------------------------------------------------------------
# We can also facet the plot so that we generate one plot per `Origin` value. We now use the
# `Cylinders` column for the color of the points:
# ------------------------------------------------------------------------------------------

cars |> @vlplot(:point, x=:Miles_per_Gallon, y=:Acceleration, column=:Origin, color=:Cylinders)

# ------------------------------------------------------------------------------------------
# Note how Vega-Lite automatically picked a continous color scale for `Cylinders` because
# that column is a numeric column. If we instead want things drawn as a nominal legend, we
# can specify the type of the encoding with a shorthand syntax:
# ------------------------------------------------------------------------------------------

cars |> @vlplot(:point, x=:Miles_per_Gallon, y=:Acceleration, column=:Origin, color="Cylinders:n")

# ------------------------------------------------------------------------------------------
# Another type of encoding is an `ordinal` encoding:
# ------------------------------------------------------------------------------------------

cars |> @vlplot(:point, x=:Miles_per_Gallon, y=:Acceleration, column=:Origin, color="Cylinders:o")

# ------------------------------------------------------------------------------------------
# We could also use different shapes of points for each `Origin`:
# ------------------------------------------------------------------------------------------

cars |> @vlplot(:point, x=:Miles_per_Gallon, y=:Acceleration, shape=:Origin, color="Cylinders:o")

# ------------------------------------------------------------------------------------------
# The plots are a little small, so lets increase the size:
# ------------------------------------------------------------------------------------------

cars |> @vlplot(:point, x=:Miles_per_Gallon, y=:Acceleration, shape=:Origin, color="Cylinders:o", height=300, width=300)

# ------------------------------------------------------------------------------------------
# Another type of plot is a histogram:
# ------------------------------------------------------------------------------------------

cars |> @vlplot(:bar, x={:Miles_per_Gallon, bin=true}, y="count()")

# ------------------------------------------------------------------------------------------
# We can easily split this up per `Origin` with the `column` channel:
# ------------------------------------------------------------------------------------------

cars |> @vlplot(:bar, x={:Miles_per_Gallon, bin=true}, y="count()", column=:Origin)

# ------------------------------------------------------------------------------------------
# Using Query.jl, we can manipulate all sorts of information into a table and then plot it:
# ------------------------------------------------------------------------------------------

readdir() |> @groupby(length(_)) |> @map({length=_.key, count=length(_)}) |> @vlplot(:bar, x=:count, y="length:o")

# ------------------------------------------------------------------------------------------
# Lets assign a plot to the variable `p`
# ------------------------------------------------------------------------------------------

p = cars |> @vlplot(:bar, x={:Miles_per_Gallon, bin=true}, y="count()", column=:Origin)

# ------------------------------------------------------------------------------------------
# And look at the type of `p`:
# ------------------------------------------------------------------------------------------

typeof(p)

# ------------------------------------------------------------------------------------------
# Plots can be saved in lots of different image formats:
# ------------------------------------------------------------------------------------------

p |> save("foo.png")

p |> save("foo.pdf")

p |> save("foo.svg")

p |> save("foo.eps")

# ------------------------------------------------------------------------------------------
# __WARNING__: The following line only works with `master` of FileIO.jl right now. Hopefully
# a new version will be tagged in a few days.
#
# We can also save a plot as a `.vegalite` file, the native Vega-Lite JSON format:
# ------------------------------------------------------------------------------------------

p |> save("foo.vegalite")

# ------------------------------------------------------------------------------------------
# Voyager uses Vega-Lite under the hood, so the two packages are tightly integrated. Lets
# pipe the `cars` dataset into Voyager again and create a plot in Voyager:
# ------------------------------------------------------------------------------------------

v = cars |> Voyager

# ------------------------------------------------------------------------------------------
# ![alt](screenshots/voyager3.png)
# ------------------------------------------------------------------------------------------

# ------------------------------------------------------------------------------------------
# We can now access the current plot in the Voyager window with the `[]` syntax:
# ------------------------------------------------------------------------------------------

v[]

# ------------------------------------------------------------------------------------------
# Note that this just returns a plot instance, as if we had created it with the `@vlplot`
# macro:
# ------------------------------------------------------------------------------------------

typeof(v[])

# ------------------------------------------------------------------------------------------
# We can therefore save it as files:
# ------------------------------------------------------------------------------------------

v[] |> save("foo2.pdf")

# ------------------------------------------------------------------------------------------
# __WARNING__: The following cells only works with the `master` branch of FileIO.jl.
#
# Here we can save the plot as a Vega-Lite file.
# ------------------------------------------------------------------------------------------

v[] |> save("foo.vegalite")

# ------------------------------------------------------------------------------------------
# We can then later load that file again, which returns a standard plot instance, as if we
# had created it with the `@vlplot` macro:
# ------------------------------------------------------------------------------------------

load("foo.vegalite")

# ------------------------------------------------------------------------------------------
# We can use this for interesting manipulations. For example, we can load a saved plot, but
# then pipe different data into it to create an updated plot:
# ------------------------------------------------------------------------------------------

cars |>
    @filter(_.Origin in ("USA", "Japan")) |>
    load("foo.vegalite")

# ------------------------------------------------------------------------------------------
# ## Missing values
# ------------------------------------------------------------------------------------------

# ------------------------------------------------------------------------------------------
# The Queryverse uses [DataValues.jl](https://github.com/davidanthoff/DataValues.jl) for its
# missing value representation.
# ------------------------------------------------------------------------------------------

# ------------------------------------------------------------------------------------------
# ## Conclusion
# ------------------------------------------------------------------------------------------

# ------------------------------------------------------------------------------------------
# ![alt](slides/Slide32.PNG)
# ------------------------------------------------------------------------------------------

# ------------------------------------------------------------------------------------------
# ![alt](slides/Slide33.PNG)
# ------------------------------------------------------------------------------------------
